giCentre - sequenceView

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations

Aidan Slingsby, City University London, a.slingsby@soi.city.ac.uk [PRIMARY contact]
Jo Wood, City University London, jwo@soi.city.ac.uk
Jason Dykes, City University London, jad7@soi.city.ac.uk

Tool(s)

sequenceView was built in a couple of days using Processing - a set of Java libraries for rapidly designing and production of graphical sketches. The giCentre's long experience of using Processing in this way makes such development rapid.

The screen is split into three parts (vertically):

Native sequences (top).
Current outbreaks, where disease characteristics are shown left of the DNA sequence as coloured squares (from left to right) symptom severity (blue), mortality (green), complications (orange), drug resistance (blue/purple), vulnerability (red) and the sum of these as a measure of overall seriousness (greyscale), where dark is more severe.
'Common sequences' (bottom): identified by the software on request (see below).

Bases are identified by hue (optionally labelled).

Interactions

Toggle DNA selection (left click)
Left/right scroll and zoom (keyboard)
Sort current outbreaks on characteristics (keyboard)
Request longest common sequences of all selected DNA (key press)
Draw horizontal line (right click)
Hide unselected DNA (key press)

The only automated function is computing the longest 'common sequences' in any DNA selection. This takes around 15 seconds. Matches of selected common sequences are identified in red. Common sequences can be hidden from view, leaving only those columns where at least one mutation has taken place in the originally-selected set of DNA, which - with this dataset - fit onto one screen.

Interpretation is left to the human. SequenceView supports the user in doing this by making effective use of alignment, sorting and interaction. SequenceView is responsive and supporting information is provided to the user quickly.

Video

Link to video (18Mb)

Answers

MC3.1: What is the region or country of origin for the current outbreak? Please provide your answer as the name of the native viral strain along with a brief explanation.

Nigeria_B, because it more of its DNA is in common with the current outbreaks than any other native strain. Example in Figure 2 of a long sequence found (red) in Nigeria_B but not in other native strains. Zooming out (Figure 1) or scrolling (video) is quick and shows this pattern occurs throughout.

Steps to screenshots: The screenshots were arrived at by (a) selecting all current outbreaks; (b) requesting the longest common sequences within these (the only automated function); and (c) selecting all the resulting common sequences (matches are then coloured red). This and scrolling through the sequences took a couple of minutes. Central Africa and Cameroon also share significant DNA with the outbreaks.

Identifying DNA sequences common to the outbreaks in the native strains and keeping DNA sequences aligned were key to answering this question.

Figure 1: Zoomed-out view showing native sequences (top), outbreaks (middle) and longest common sequences founds in the outbreaks (bottom). Common sequences are identified in red and those over which the mouse is positioned are coloured yellow.

Figure 2: Section of the DNA sequences showing similarily with Nigeria_B.

MC3.2: Over time, the virus spreads and the diversity of the virus increases as it mutates. Two patients infected with the Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence 583. One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each patient.

Which patient likely contracted the illness from Nicolai and why?

The patient with strain 123, because it is more similar to Nicoli's strain (583). Only one base is different (at position 269) as opposed to three based for the other strains (Figure 4).

Steps to screenshots: (a) The three strains were selected (numerically ordered); (b) common sequences computed; (c) these were selected (matching sequences are highlighted in red); (d) non-selected sequences were hidden (Figure 3); and (d) common sequences were hidden leaving just columns in which at least one mutation had taken place (Figure 4).

Hiding common sequences allowed only those bases in which a mutation had occurred - key to answering the question.

Figure 3: (Zoomed-in) view showing all the native sequences (top), just 51, 123 and 583 of the outbreaks and the common sequences (bottom). Common sequences are red.

Figure 4: As figure 3, but with common sequences hidden, and with the base labels showing.

MC3.3: Signs and symptoms of the Drafa virus are varied and humans react differently to infection. Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them.

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).

A>C, 269 - severe x4.
A>T, 946 - severe x9; moderate x5 (842 is very correlated)
G>C, 212 - severe x3; moderate x1.

Outbreaks sorted in order of severity, within which the mouseovered column is sorted by its bases (compare Figures 5 and 6).

Assumptions: (a) mutation only affecting one strain not a 'top' mutation; (b) mutations should apply to different sequences; (c) all severe DNA sequences should be covered; (d) mutation correlation does not imply all are needed.

Steps to screenshots: (a) select all outbreaks; (b) find common sequences; (c) select these; (d) sort on severity; (e) additionally sorting by bases in various columns.

The ability to sort on severity and bases in any column, to hide common sequences (isolate mutations) and column index tooltip, were key to answering this.

Figure 5: Sorted by symptom severity and bases on column 269.

Figure 6: As above, but sorted by 223.

MC3.4: Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question. To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions. In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.

T>C, 842 - well correlated with 946 (figure 7). Strongly affects vulnerability (see below)
T>C, 790 - not as severe as 842, but affects different DNA sequences. Complimentary to the mutation above
A>T, 955 - only 4 base mutations and not particularly severe, but complementary to both mutations above (affect different strains)

We used the same technique and assumption as above, but we sorted the strains on seriousness (sum of the disease characteristics; the grey column). This gave equal weight to these characteristics. Other characteristics could be explored by using the appropriate hue-based lightness (left of the strain) and by sorting by these. White-space is inserted between categories.

Within the sorted category, columns could be sorted on the bases of the column under which the mouse cursor was positioned, spatially consolidating bases of the same type. This helped identify correlation across strains and allows proportion of bases in a particular category to be estimated/counted more effectively.

Seriousness was highly correlated with severity. We could have choosen those mutations that affected just the top two or three seriousness categories, but some of these were answers to the previous question (i.e. strongly associated with severity). Also, since only three strains were in the most serious group, the sample sizes were quite small (see assumptions in previous answer). So instead we opted for complementary mutations (affecting different strains) which together covered most of the top half of serious cases.

We did not find evidence that more than one mutation on the same DNA sequence was necessary. Correlated mutations are candidates, but we decided that the evidence was circumstantial.

We can also explore other important characteristics of the disease, visually.

Mortality: the mutation A>C at 269 (identified as leading to severe forms) is associated with the highest mortality levels.
Complications: we noticed that about half of major complications tend to be associated with less serious strains. In Figure 9, we sort by complications and can identify that A>G at 233 is a key mutation (this was a candidate for the previous question). Columns 720 and 821 also contain mutations associated with major complications.
Drug resistance: Figure 10 (sorted by drug resistance) shows that C>G, 22 is is associated with an increase in drug resistance.
Vulnerability The mutations at 212 (G>C) and 842 (T>C) appear to be important for the susceptibility to 'at-risk' sections of the population.

Mutations that change different characteristics of strains of the outbreak will require different planning responses by health authorities. Identifying key mutations as the pandemic continues is therefore essential for managing the response.

Metrics

Metrics could be computed for each mutation (e.g. number of mutations weighted by seriousness), and we would advocate implementing some to assist in data interpretation in future - they could, for example be used to identify likely DNA candidates for a particular critierion or be used as a basis for sorting.

We did find, however that interpreting the data visually was essential and any further metrics should support rather than replace this. For example, we easily confirmed by mutations were substitutions rather than insertions, by looking for offsets and not finding any. Sorting and alignment using the gaps between disease characteristic categories and horizontal line placement (right click; figure 8) were particularly important. For example, the mutation at 955 does not look significant in isolation, but it is when considered in the context of the other mutations (they are complementary).

Flexibility

Our approach of rapidly building a tool as part of the data exploration process using Processing enables us to incrementally add functionality where needed, including implementing new metrics or loading additional datasets. This allows the tool to grow in line with the depth of analysis required. SequenceView was built for this particular dataset in mind, but the functionality was designed to be as generic as possible and it should work with similar data of the same data format.

SequenceView supports analysts by providing appropriate sorting, symbolism, data hiding, alignment, interaction and the function for automatically identifying the longest common sequences in any arbitrary selected set. This small set of basic functions supports the analyst in answering a wide set of questions about the DNA sequences.

Scalability

SequenceView was designed to accommodate this size of dataset in terms of computation time, graphical display and memory, and it is expected to work for datasets of similar size. In the current design all mutations fit on the same screen once the common sequences have been hidden. If there were many more mutations and/or longer DNA sequences, it might not be possible to fit them all on the screen at once. The screen size of bases could be reduced and scrolling could be used but only if the requirement for scrolling was minimal as this would make visual analysis much more challenging. For datasets that are much larger, some redesign might be necessary. For example, allowing the user being able to hide any number of arbitrary column not currently under consideration. Or summarising/grouping mutations by disease characteristic category instead of showing all DNA sequences.

Programming

Building tools using Processing to answer such questions is efficient for us because of our experience. Processing is, however, designed for designer who are not necessarily programmers, to create 'sketches' like this quickly. We do not aim at this stage to built complete software tools, but rather to design and test visualisation and interaction ideas for visual analytics. However, we are building a reusable set of libraries that are publicly available. We (or somebody else) may implement those designs and ideas into a full piece of software later.

Figure 7: Sorted by overall seriousness.

Figure 8: Sorted by overall seriousness, with horizontal line showing that T base (mouse cursor) is complementary to 842 and 790 for this strain.

Figure 9: Sorted by complications, note likely key mutation (column 223).

Figure 10: Sorted by drug resistance, note likely key mutation (column 22).